Cluster - preserving embedding of proteins by

نویسندگان

  • Gabriela Hristescu
  • Martin Farach-Colton
چکیده

Similarity searching in protein sequence databases is a standard technique for biologists dealing with a newly sequenced protein. Exhaustive search in such databases is prohibitive because of the large sizes of these database and because pairwise comparisons are slow. Heuristic techniques, such as FASTA and BLAST, are useful because they are fast and accurate, though it has been shown that exhaustive search is more accurate. Therefore, there are times when one would like to perform an exhaustive search. We propose an efficient method, called SparseMap, for preprocessing a database of proteins to support efficient similarity searches using expensive but sensitive distance functions, such as those based on Smith-Waterman similarity. Our method is based on a Lowdimensional Euclidean Embedding approach. We compare our method with other embedding approaches, and show that our method is faster and produces embeddings which preserve more biological information about the proteins, such as pairwise distance and biological clusters.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Visualizing Graphs as Maps with Contiguous Regions

Relational datasets, which include clustering information, can be visualized with tools such as BubbleSets, LineSets, SOM, and GMap. The countries in SOM-based and GMap-based visualizations are fragmented, i.e., they are represented by several disconnected regions. While BubbleSets and LineSets have contiguous regions, these regions may overlap, even when the input clustering is non-overlapping...

متن کامل

Cluster - preserving embedding of proteins by Gabriela

Similarity searching in protein sequence databases is a standard technique for biologists dealing with a newly sequenced protein. Exhaustive search in such databases is prohibitive because of the large sizes of these database and because pairwise comparisons are slow. Heuristic techniques, such as FASTA and BLAST, are useful because they are fast and accurate, though it has been shown that exha...

متن کامل

Unsupervised Shape Clustering using Diffusion Maps

The quotient space of all smooth and connected curves represented by a fixed number of boundary points is a finite-dimensional Riemannian manifold, also known as a shape manifold. This makes the preservation of locality a critically important issue when reducing the dimensionality of shapes on the manifold. We present a completely unsupervised clustering algorithm employing diffusion maps for l...

متن کامل

Linear-Space Algorithms for Distance Preserving Embedding

The distance preserving graph embedding problem is to embed vertices of a given weighted graph into points in 2-dimensional Euclidean space so that for each edge the distance between their corresponding endpoints is as close to the weight of the edge as possible. If the given graph is complete, that is, if distance constraints are given as a full matrix, then principal coordinate analysis can s...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999